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Abstract 

We introduce a new regression problem which we call the Sum-Based Hierarchical Smoothing 
problem. Given a directed acyclic graph and a non-negative value, called target value, for each 
vertex in the graph, we wish to find non-negative values for the vertices satisfying a certain 
constraint while minimizing the distance of these assigned values and the target values in the 
^p-norm. The constraint is that the value assigned to each vertex should be no less than the 
sum of the values assigned to its children. We motivate this problem with applications in 
information retrieval and web mining. While our problem can be solved in polynomial time 
using linear programming, given the input size in these applications such a solution is too slow. 

We mainly study the i!i-norm case restricting the underlying graphs to rooted trees. For this 
case we provide an efficient algorithm, running in 0{n^) time. While the algorithm is purely 
combinatorial, its proof of correctness is an elegant use of linear programming duality. We also 
present a number of other positive and negatives results for different norms and certain other 
special cases. 

We believe that our approach may be applicable to similar problems, where comparable 
hierarchical constraints are involved, e.g. considering the average of the values assigned to the 
children of each vertex. While similar in flavour to other smoothing problems like Isotonic 
Regression (sec for example [Angelov et al. SODA'06]), our problem is arguably richer and 
theoretically more challenging. 



* Department of Computer Science, University of Toronto 
^Thoora Inc., Toronto, ON, Canada 

■'This research was supported by the MITACS Accelerate program, Thoora Inc., and The University of Toronto, 
Department of Computer Science. 



1 Introduction 

The prevalence of popular web services like Amazon, Google, Netflix, and StumbleUpon has given 
rise to many interesting large-scale problems related to classification, recommendation, ranking, 
and collaborative filtering. In several recent studies (e.g. |KFB091 IPG081 ICKP07| ). researchers 
have incorporated the underlying class hierarchies of the data-sets into the setting of recommenda- 
tion systems. Moreover, Koren et al. |DKKll] recently demonstrated an application of hierarchical 
classifications of topics, i.e. taxonomies, in Collaborative Filtering settings, in particular, music rec- 
ommendation. In these application scenarios, the taxonomies are abstracted as trees. Associated 
with the vertices are scalar target values, typically inferred through the use of various machine 
learning or information retrieval methods. For instance, given a hierarchy of topics and a search 
query, the target values could be the relevance measures of the topics to the search query. 

When a taxonomy is used, one would usually like to enforce particular constraints on the value 
assigned to the vertices to properly represent the hierarchical relationship among them. Typically, 
the relevant machine learning approaches are ill-equipped to handle these requirements. Often, 
these constraints state that the value of each vertex should be at least some function of the value 
of its direct children in the taxonomy (e.g. |PG08l ICKPOTj ). 

Going back to the previous example of topics and search query, imagine that the taxonomy 
contains the topics sports, baseball, football, and basketball with the first topic being the parent 
of the other three and that the search query is "ESPN". One would like to find the relevance of 
this query to every topic in the taxonomy. A reasonable requirement of these relevance values 
would be that the relevance of "ESPN" to sports would be no less than the sum of its relevance to 
baseball, football, and basketball. One way to solve this problem would be to directly impose such 
a constraint on the learning algorithm that infers the relevance values using regularization; i.e. 
adding an additional term in the objective function of that algorithm penalizing any violation of 
the constraint. However, this approach has two problems. First, it "softens" our requirements; i.e. 
it allows for possible violations, to some limited extend. Moreover, it can dramatically deteriorate 
the running time of the process of learning or restrict our choice of the learning algorithm. 

Instead, we take the following, widely used, two-step approach. Given a search query s, we 
first infer each of the relevance scores of each of the topics, disregarding the hierarchy constraints. 
Then, we smoothen the inferred relevance scores by modifying them so as to uphold the above sum 
constraint. We would want the change of the relevance scores in the second step to be as small 
as possible. As the relevance scores are scalar values, we can represent both the original and final 
relevance scores as two vectors with non-negative values, and measure their difference in a suitable 
norm (e.g. the ii,i2 or the ioo norms). The subject of this paper is how to perform the second 
step. 

We formulate this problem which we call the Sum-Based Hierarchical Smoothing problem 
(SBHSP) as follows. Given a rooted tree (or in general a directed acyclic graph) G = {V, E) 
and a vector of original vertex values (called target values^ a. — (aui i fli)2 ) ■ ■ ■ ; ^vn) the objective is to 
find a vector of new vertex values (called assigned values) x = (x„^ , x^^ , ■ ■ ■ , Xi,^) with the following 
properties, (i) for any node w with incoming edges {ui,w), . . . , {uk,w) we have x^ + • • ■+Xuf, < Xw 
(ii) ||a — x||p is minimized. Different values of p result in different variants of the problem. We 
mainly study the problem for p = 1 and p = oo but the case of p = 2 is also interesting. It is not 
hard to see that for p = 1 the problem can be solved in polynomial time using linear programming 
(see inequalties (I3ap -(]3dj)) and for p > 1 it can be solved by using a suitable separation oracle and 
the Ellipsoid method. However given the typical size of taxonomies these solutions are too slow. 
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We note that this problem seems to be more complex than other previously considered similar 
problems as the assigned value of each vertex affects the possible values for any vertex it shares a 
parent with. In particular, to the best of our knowledge techniques used for similar problems are 
ineffective for it. 

Contributions: Our main contribution is a purely combinatorial algorithm when the input is a 
rooted tree and p = 1 (i.e. the £i norm) that runs in time O(n^). We note that the ii norm was 
previously used as a good measure of difference in similar regression problems (e.g. see |AHK W06] ) . 
As many hierarchical structures in practice are trees, our algorithm can be used in many practical 
applications. Our second contribution is a linear time algorithm for the case p = oo which works 
for any directed acyclic graph. We also show an efficient FPTAS for optimizing the ii norm for 
another class of DAGs (directed bilayer graphs.) Finally, we show that if one adds the extra 
condition that the assigned values should be integral the problem is hard to approximate (to within 
a polylogarithmic factor) for any ip norm for 1 < p < oo. Interestingly, given that our algorithm 
for the £i norm on trees always outputs an integral solution this last result suggests that new ideas 
are needed to extend it to general DAGs. 

Our algorithm for the ii case has a rather simple structure. We assign values to the vertices of 
the tree in a bottom-up manner. For each vertex we first assign a valid (but possibly suboptimal) 
value and then use paths going down from that vertex to "push the excess" down the tree and 
improve the objective value. While the algorithm is purely combinatorial, its proof of correctness 
is an elegent use of linear programming duality. In particular, we use the complementary slackness 
condition to show that if the algorithm can no longer push the excess of a node down the tree the 
values assigned to its subtree most be optimal. 

Organization: We present the relevant previous work in Section [2j In Section [Sj we present a 
precise definition of the problem and some preliminaries. We present our first algorithm which is 
for the case of trees and ii norm in Section U] and prove its correctness. In Section [5] we show how 
this algorithm can be optimized to run in the promissed O(n^) time. We conclude and propose 
several open problems in Section [6l We extend the algorithm to the case of weighted l\ norm in 
AppendixlAl We present our algorithm for the case of £oo in Appendix[Bj We leave our hardness of 
approximation result to Appendix [C] and our results for the case of bilayer graphs to Appendix [Dl 

2 Previous Work 

The main motivation of the current paper is the application of taxonomies in regression. A recent 
example, studied by Koren et al. |DKKllj . is the application of topic hierarchies in the context 
of collaborative filtering. They provide a method of linking the data-set to a four level taxonomy, 
which helps them circumvent difficulties related to the size of the data-set. 

Regression and smoothing problems have been studied extensively in recent years. Perhaps the 
most relevant problem to our setting is the Isotonic regression problem and its variants. There 
one wishes to find a closest fit to a given vector subject to a set of monotonicity constraints. More 
precisely, let a = (ai, . . . , a„) be n target values, and let be a set of m pairwise order constraints 
on these variables. The Isotonic regression problem is to find values x = {xi, . . . such that 
Xi > Xj whenever {i,j) £ E for which the distance between x and a is minimized. To put things in 
a language similar to ours, in isotonic regression the assigned value of each vertex should be bigger 
than the maximum of the assigned value of its children as opposed to the sum of those values in 
our problem. 

Common choices of distance functions include the weighted ii, £2 and ioo norms. The Isotonic 
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regression problem for such weighted norms have been studied extensively. For some of the results 
for the ii and ^2 norms see |Sto081 IAHKW061 IBC90j . Stout also maintains a web site containing 
some of the fastest known Isotonic regression algorithms for different settings at [Stoj . 




The Isotonic regression problem belongs to a more general class of problems known as order 
restricted statistical inference. Order restricted statistical inference was first studied by Barlow et 
al |BBBB72] . The Isotonic regression problem became popular since it has many applications in 
testing |LB01[|M^C0T] . modelling |MJDP+00l IUlm86| . data smoothing |FT84l IPGUS] and other 
areas jRWDSS] related to statistical and computational data analysis. It has been shown to be an 
important post-processing smoothing tool to impose desired hard constraints on the values that a 
learning algorithm has produced. Variations of Isotonic regression have been used for other appli- 
cations like template learning |CKP07j . ranking jPCZ+lOj |MSCZ10| . and classification jKFB09| . 

3 Preliminaries 

We now formally define the problem as follows. Given a tree (or DAG) T = {V, E) rooted at node 
r G y, and a vector a S R>q of the target values of the vertices. We wish to find the closest vector 
X E 1R>0) in the ^p-norm, so that for each node v, with children ui, . . . ,Uk, > +Xu2 + - • •+Xuf,- 

While most of the paper addresses the case of p = 1, we also discuss the case of p = 00 in 
Appendix [BJ Note that our hardness results apply to all 1 < p < 00. 

For a vertex u S T, we denote the set of nodes with edges to u the children of u or C{u), 
similarly the parent of u is A{u) (in the case where the underlying graph is a general DAG, A{u) 
will be a set of nodes). Throughout the paper, we will make extensive use of various paths in the 
given tree. For this purpose, we let Pu^v denote the (unique) path from vertex u to vertex v in 
T. We denote the sub-tree rooted in vertex v hy T^. For a given sub-tree Ty, we define a|T^ as the 
vector of target values corresponding to the nodes in T^; we similarly define x|t^. 

4 The Algorithmic Approach for ii 

As an initial attempt, consider the following trivial feasible solution. For each leaf i & T, set 
xi = ai. Then, for each internal node v set x^ = niax(at,, ^^g^^^^-j x^), by traversing the tree in 
post-order. However, it is not hard to see that this approach would be arbitrarily sub-optimal (see 
Figure [TJ) Indeed, in some cases it is preferable to lower the existing x values of a given node's 
children, instead of raising the node's x value, as this might help the objective value on the nodes 
ancestors as well. 

In order to optimize the objective function, our algorithm will proceed as follows. By traversing 
the tree T in post-order, it performs the following sequence of steps for every vertex v. x^ is initially 
set to the maximum of and the sum of the x values of its children, which is clearly a feasible 
assignment for T^. It then improves the assignments for by sequentially decreasing the values of 
some vertices that are located on some path P from v to some other node in Ty. The adjustments 
are made so that the overall improvement in the objective function equals the improvement in 
I a„ — Xy\. We will refer to such paths as push-paths, and the improvements made on them as push 
operations. The algorithm is presented below as Algorithm [TJ The procedure Push — Path{x, P, e) 
checks what is the improvement on the objective function value if we reduce the x value of all 
vertices in the path P by e. This path will always start at the current vertex v. 

For now we do not discuss how to find the push path or the exact value that we push down that 
path. This abstraction was made deliberately, so as to to separate the correctness of the algorithm 
from its performance. In fact, we later show that the individual paths need not be enumerated 
separately. 
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Algorithm 1: Push-Improve 



Input: Undirected tree T = (V, E), with a vector of vertex weights a G 
Output: A feasible vector of weights x € for V 

1 Let vi,V2, ■ ■ ■ , Vn-i,Vn be the vertices in T sorted in post-order. 

2 for V 1 to 71 do 

3 

ImproveSubtree (v) 



4 

5 end 

6 ImproveSubtree ( Vertex u) 

7 while 3 path P from u down to a vertex v, and e > such that v is either a leaf or 

> I]«,GC{j;) and Push-Path (x,P,e)=e do 

8 I Push-Path (x, p, e) 

9 end 

10 Push-Pa.th(.Assignment X, Path P, Non-negative real-value e) 

11 begin 

12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 



Let vi, . . . ,Vk be the sequence of nodes on the P from top to bottom. 

old = "YlilKiKk l^fi ~ 

for i = 2 to k do 
if t > then 

I "^Vi "^Vi ^ 

end 
end 

new = J2i<i<k l^fi ~ 
return old — new 



23 end 



The fohowing theorem states that the output of Algorithm [T] is optimal. 

Theorem 4.1. When Algorithm [I] terminates, the obtained vector x is a feasible and optimal 
assignment for the given tree T . 

Our proof of Theorem 14.11 will proceed as follows. We begin by characterizing the necessary 
push-path improvement at each step of the while- loop. We then inductively argue that before and 
after each push operation, the value of the objective function for each sub-tree rooted in a child 
of the current node remains optimal. We conclude by using an LP duality argument in order to 
show that once no more push operations exist for the current vertex in the for-loop, is assigned 
optimal X values. 

The following lemma refers to the series of improvements performed on node f , and can be 
viewed as the set of invariants of the outer for-loop. 

Lemma 1. Let v be the current node, P = {v = uq, . . . , Uf^) be a push-path such that for 1 < i < k, 
Ui G C(iij-_i). Then the following invariants hold throughout the execution of the inner while-loop: 
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1. If, for e > 0, Push — Path{x,P,e) > 0, then Push — Path{x,P,e) < e. Furthermore, if 
for path P and e > 0, Push — Path{x, P,€)=5>0, then there exists e' > such that 
Push - Path{^,P,e') = e' . 

2. If for path P and e > Push — Path{x, P, e) = e, then for each u G C{v), is optimally set 
before and after running Push — Path{x, P, e). 

Proof. First, notice that the above invariants clearly hold if the current node u is a leaf, as their 
initial x values are set to their a values, and will only be modified as a result of performing 
Push — Path on their ancestors. Assume that the invariants hold for all nodes preceding v in the 
post-order, and suppose for contradiction that there exists some path P = (v = uq, . . . , Um) and 
e > such that Push — Path(x, P, e) > e. The first part of the first invariant clearly holds since 
the sub-trees rooted in the children of v are assumed optimal. Hence, any e-improvement on v 
cannot entail an additional improvement on the rest of the push-path. 

We now consider the second invariant, while briefly deferring the proof of the second part of the 
first invariant. First, notice that for each i S C{v) — {ui}, the assignments to do not change. Let 
P be a modification-path, and e > such that Push — Path{x, P, e) = e. On the other hand, notice 
that is reduced by exactly e. This implies that ||x|2"^ — ajr^ || remains unchanged, thereby 
remaining optimal. 

We now turn to the remaining part of the first invariant. Consider a modification-path P and 
5 > 0. By the first part of the invariant, Push — Path{x, P, 6) < 6. If Push — Path{x, P, 6) = 5, 
then the claim holds trivially. Hence, assume Push — Path{x, P, 5) < 5. 

We restrict ourselves to dealing with 5 values in the range (0, Xy — Oy]. The following observation 
stems from the fact that during the push operation, x values along P only decrease. 

Observation 1. For path P and e > 0, if Push — Path{x, P, e) > 

\{j eP:x.j> a,}\ > \{j eP:x,< a,}| (1) 

In fact, using the induction hypothesis, we can make Observation [1] even stronger: 
Claim 1. For path P and e > 0, if Push - Path{x, P,e) > 

\{j £P:x,> a,}\ - \{j eP:Xj< aj}\ = 1 (2) 

Claim [T] can be justified by noticing that otherwise, the sub-tree rooted in one of u's children 
would be amenable to path-improvements, contradicting optimality. The invariant follows, as we 
could simply set e to be the minimum (positive) amount that maintains the number of nodes along 
P with X values that are larger than their a values. □ 

Lemma [T] implies that each push operation improves the value of the objective function for 
the current sub-tree, while maintaining the optimality of the sub-trees rooted in the children of 
V. However, in order to show that the local optimum obtained by the algorithm is the globally 
optimal feasible solution, we need to argue that as long as the current assignment is not optimal, 
there exists a feasible path-improvement with a corresponding e > value. The following theorem, 
which constitutes the main technical part of this paper, formalizes this notion. 

Theorem 4.2. Upon termination of the inner while-loop, the sub-tree rooted in vertex v is assigned 
optimal X values. 
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Proof. First, notice that the algorithm clearly maintains the feasibility of the solution throughout 
its execution. The following observation follows from the definition of the algorithm. 

Observation 2. During the execution of the algorithm, x^ > a^- Furthermore, if x^ = a^, the 
solution is trivially optimal. 

The proof of Theorem 14.21 will proceed as follows. We give the LP for the optimization problem, 
and its corresponding dual LP. We then construct a feasible solution for the dual LP that satisfies 
the complementary slackness conditions with respect to the solution of the algorithm. In order 
to construct a valid dual solution, we inductively bootstrap the dual solutions constructed for the 
nodes rooted sub-trees. From LP duality, we then conclude that the two solutions are optimal for 
the primal and dual problems. Recall that we inductively assume that the sub-trees rooted in the 
children of v are optimally adjusted. 

It is not hard to write a linear program which formulates our problem. This program and its 
dual can be seen below. The variables di are introduced to avoid using absolute values in the 
objective function. 



Vi e % (4a) 
Vi G T,\{v} 
(4b) 

(4c) 

Vi G r„ (4d) 

Note the special case for vertex v (inequality Hcj) . By denoting /3j = Aj — A^, one can simplify the 
dual LP: 

max ai/3j 

subject to - 1 < /3i < 1 (5a) 

f5i + Ui — a^(j) < yi £ Ty — V (5b) 

I3v + a^<0 (5c) 
ai>0 yi£Ty (5d) 

We now summarize the necessary complementary slackness conditions required by the dual: 

Xi>ai^ \i = 0,X[ = l (/3i = -l) (CI) 
Xi<ai^ Xi = l,X'i = (/3i = l) (C2) 

Xi > Xj =^ tti = (C3) 

Xj > ^ Ai - A- + - ap(j) = (C4) 

Since throughout the execution of the while loop x^ > and the case where trivial, we 

will assume from now on that x^ > a^. This implies the last necessary condition: 



min 






max ai{\i — A-) 










subject to 


di + Xi> ai 


(3a) 


subject to Aj + A- = 1 






(3b) 


(Ai - A-) +ai- ap(i) < 


Xi 


5] X, > 


(3c) 










(A^ - A^) + < 




> 


Mi £ (3d) 


Ai, A-, Oi > 



Xt, > a^, =^ = 1 (C5) 
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We begin by suggesting an initial assignment which might not be feasible, and in addition, 
might violate one of the complementary slackness properties. 

The following lemma is a direct consequence of the construction of the dual LP and the com- 
plementary slackness constraints. It refers to a family of assignments to the dual LP that satisfy a 
subset of the complementary slackness conditions. 

Lemma 2. Let x, d be a feasible solution for the primal such that the sub-trees rooted in v are 
optimally assigned and v admits no Push-Path improvements. Let a,j3 be an assignment for the 
dual variables such that the following holds: 

ai = ap(^^-P,, ifxi>0 ^ _ I !fxi<ai (6) 

a,; 



a. -ft, otherwise \ a yalue ^n [-1,1], ^f x. 



Then a, (3 satisfy all the properties of a feasible dual solution, and {a, 13) along with (x, d 
satisfy complementary slackness except that ai might be negative for some nodes, and conditionlC, 
could be falsified. 

Next, we observe that if our modified dual LP admits an optimal feasible solution, then our 
range of possible values for a, P can be narrowed due the total unimodularity of the simplified 
constraint matrix of the dual LP: 

Observation 3. If the dual LP has an optimal and feasible solution, then it has an integral, feasible 
and optimal solution, as well. In particular, for every i G ft G { — 1,0, 1}). 

Observation [3] can be verified by induction on the constraint matrix of the dual LP, in order to 
show that every square sub-matrix of it has a determinant of ±1. 

The following lemma complements Lemma [2] by suggesting a concrete assignment for each ft 
in the case whenever Xi = a^. 

Lemma 3. Consider an assignment as described in Lemma\M If we set ft = 1 whenever xi = a^, 
then: 

Mj G T^,Xj > Xk =^ < 

k: child of j 

Proof. We prove the claim by way of contradiction. Suppose that the claim is false, and let j be 
the highest node for which the claim does not hold. That is, Xj > Yl^ecij) -^k and aj > 0. Consider 
Pj^v, the path from j to v. As we are trying to prove an upper bound for aj, we will assume that 
for every node k on the path from v to j ak = Oip(k) ~ l^k, as lower values will only strengthen our 
claim. This implies 

Since ft = 1 for all nodes i such that x, < aj, and ft = — 1 otherwise, aj > implies: 

\{k G Pj^v : Xi > ai}\ > \{k G Pj^y : Xi < ai}\ (8) 

This implies that we can reduce all x values of nodes Pj^v by an amount of at most xj — J2kec{j) 
so as to get a feasible solution with a better objective function value. However, this is exactly a 
push operation, thereby contradicting the assumption of no further path-paths. □ 
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The following corollary is the contra-positive statement of Lemma [3] 

Corollary 1. // there exists a node j £ T„ such that aj > and Xj — 'Ylk(^c{j) ^ ^> ^hen there 
exists an ancestor i of j such that /3j G {0, —1} and Xi = Oj. 

We now prove the main theorem by way of induction. We inductively assume that the sub-trees 
rooted in v have both an optimal setting for the primal LP, and there exists an integral and feasible 
solution for the dual LP that satisfy the complementary slackness conditions. Without loss of 
generality, we assume that no child i oi v has Xi = 0, since otherwise, we could use its assignments 
without any modifications, as Xi does not harden the feasibility constraints of v. 

Consider the assumed set of assignments for the sub-trees rooted in v. By the assumption, 
they have corresponding assignments to the dual LPs. Observe that since the conditions listed 
in Lemma [2] are a subset of the complementary slackness conditions. Lemma [2] applies to them 
automatically. 

We will start from a tentative solution to the dual by initially set the {a, (3) according to the 
assumed assignments, and set a„ = 1, (iy = —1. We let si denote the above assignment. Notice 
that for each child i oi v, the dual LP that corresponds to the current assignment had 

ai + j3i = 0, 

as i was the root (this is a strict equality as by our assumption > 0). However, in the current 
LP, the corresponding dual inequality becomes 

Pi + ai- av = 0, 

As ay = 1, this equality is therefore violated. In order to rectify this, we first raise all the a value 
(except v's) by 1, and denote the resulting solution by S2. Note that by the feasibility of the original 
assignments to the sub-trees and by the definition of S2 all the nodes in have non-negative a 
values. Also observe that S2 now has all the properties listed in Lemma [2l Thus, by Lemma El 
we can conclude that S2 is a feasible solution to the dual LP, and S2 along with (x, d) satisfy 
complementary slackness except that complementary slackness condition IC3I might be violated. 

Our next step would be to adjust S2 so as to fix any violation condition of condition IC3I Let 
W be the set of all infeasible nodes: 

W = {j : Xj > ^ Xfe and > 0} (9) 

By Corollary[Tl for each j £ W there exists an ancestor i such that (1) Xi = and (2) /3j G {—1,0}. 
We let 

X = {ien:xi = ai,Pi E {-1, 0}} (10) 

Moreover, we let 

Y = {i ^ X : there is no ancestor of i in X} (11) 

Thus, for each node j £ W there exists an ancestor i £Y. 

We now define the final solution to the dual LP. Define assignment S3 to the dual LP for by 
taking solution S2 with the following modifications: 

1. yk £ Ti, such that i gY, subtract by 1. 



8 



2. Vi G y add 1 to 

Increasing the /3j values by 1 makes sure that complementary slackness condition IC4I is satisfied 
after applying the first step. Applying the first modification step guarantees that complementary 
slackness condition IC3I is again satisfied, as all nodes in W undergo the first modification. Observe 
that by definition, all the sub-trees rooted in nodes in Y are pair-wise disjoint. Hence, each a value 
can be decremented at most once. Also observe that in addition to nodes in W, other nodes may 
have their a values decremented. However, as by the definition of W, these nodes do not need to 
maintain condition IC31 and thus this step will not violate their constraints. In addition, their a 
values are guaranteed to remain non- negative as they were previously incremented by 1. 

In conclusion, all of the complementary slackness conditions for the dual LP now hold for (x, d) 
and S3. Therefore, (x, d) is an optimal solution for T^. □ 

5 The Algorithm 

Given the general technique presented in Algorithm [H we conclude our results by giving an O(n^) 
algorithm that follows the spirit of improving by pushing the surplus from a given vertex downward, 
along a path. Recall that the algorithm Push-Improve performs the push operations one path at 
a time. Instead, we can leverage the fact that some paths can share the same prefix. Specifically, 
instead of the inner while-loop, executed for each node v in the tree, we introduce a depth-first- 
search algorithm in which for each node j € T^, the algorithm remembers the maximal amount, 
pushable through Pv^j ■ 

We make use of two measures, defined for each node u £ Ty. Let 6u = \{j G Pv^u ■ Xj > 
aj}| — |{j G Pv^u '■ Xj < aj}\. In other words, for any push operation along Py^u, is the difference 
between the number of nodes that will improve the objective function value, and the number of 
nodes that will worsen the objective function value, if we push a small enough value through Py^u- 
Additionally, we define the positive bottleneck along Py^u as = minjgp^_j^^ 
This is the maximum e we can push on the path Py^u while gaining exactly in the objective 
function. In order to maintain feasibility, we restrict to be no more than x^, for any node k 
on Py-^u- This value will have a similar function as the e value given in Algorithm [H That is, 
for the current node v, and a successor u, will serve as the amount of excess we push through 
Py^u- Our algorithm will maintain feasability by restricting the decrease in Xu by the sum of the 
decreases made on its direct children of u (unless x^ was strictly bigger than the sum of the x's of 
its children before the decrease.) 

The final algorithm for optimizing the assignment to Ty , can be seen as Algorithm [5] in Ap- 
pendix |El It differs from Algorithm [T] in the way the sub-tree T„ is modified for each node v. 

The following theorem states that Algorithm U] is optimal. 

Theorem 5.1. When Algorithm^ leaves node v € V, there is no push-path going from the root r, 
ends at a leaf, and passes through v. 

Proof. First, we note the following observation, which suggests that the potential for improvement 
on any path from the root to a node v cannot increase. 

Observation 4. Let u be a node in T. Let 

K = l{^ S Pr-^U ■ Xy > ay}\ - \{i G Pr^u ■ Xy < fl^ } | . 

(5* does not increase throughout the execution of the algorithm. 
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The observation follows immediately from the fact that the only modifications to the x values 
of the nodes are decreases. 

We proceed to prove the lemma by induction on the height h of the node v. For /i = (leaves), 
the claim is trivial. Assume the claim holds for h = k, and let f be a node of height k + 1. The 
claim follows immediately from Observation U) no sub-tree rooted in a child of v can be improved 
as a result of a push-path through it. Additionally, the path from r to w never becomes amenable to 
improvements through push operations, once the algorithm leaves v. This concludes the proof. □ 

Running Time The algorithm essentially performs a depth-first-search for every node v on the 
tree. Therefore, the running time of the algorithm is O(n^). 

6 Conclusions and Future Work 

We have demonstrated the technical difficulties that our problem entails, as well as an efficient 
method for handling a broad class of instances of the problem. Due to their high efficiency, our 
methods can be run on relatively large instances in practice. We also believe that our algorithm 
might be applicable to settings beyond recommendation systems. 

An immediate open question is to extend our algorithm to the case of general DAGs. It seems 
that one needs some new ideas to give a combinatorial algorithm for this general case. In fact even 
a (fast) approximation algorithm for this case seems to be beyond the reach of our techniques. 
Another interesting direction would be to consider other measures such as the ^2-iiorm. Due to 
the fundamental difference between the ii and £2 norms, we suspect that this different distance 
measure will require a completely different approach. 

In addition to considering alternative objective functions, we can also consider other constraints. 
For instance, we can consider comparing the value assigned to each node to the average value of 
its children. Another type of constraint would be to require equality between the value of a node, 
and the sum of the values of its children. 
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A The Weighted £i-Norm Case 

We now discuss the case where the nodes on the given tree can have varying levels of importance 
with respect to the objective function. Specifically, as done in related studies, we consider the case 
in which for each node i £ V, there is an associated weight Wi. For ease of presentation, we assume 
that all weights are integral, i.e. w G N>. The objective function will be g{x.) = X^jgy ^* ' l*^* ~ 
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Hence, we can simply reinterpret the variable 5i defined for Algorithm H] as the weighted bal- 
ance between the nodes which will benefit, and the nodes that will "suffer" as a result of a 
Push-Path operation. More precisely, when considering the path Pi,^i, we will compute 5i = 
^jePy^fx >a "^j ~ '^jePy^i x <a '^j- Therefore, a feasible push-improvement across a path Pv^u 
would lead to an improvement if and only if 5„ > 0. 

The above discussion leads to the following simple modification to procedure SetParams, given 
in Algorithm [2j Note that the weighted case reduces to the unweighted case by setting the weights 

Algorithm 2: The modified Set-Params procedure for the weighted case 
Input: Vertex v 

1 Set-Params ( Vertex Non-negative integer 5, Non-negative real-value e) 

2 begin 

3 if Xj > then 

4 I 6i ■(^ 6 -\- Wi,ei ^ m.in{e,Xi - Ui} 

5 end 

6 else 

7 \ 5i ^ 5 - Wi, Ci ^ min{xi, e} 

8 end 

9 end 



to 1. Clearly, the algorithm has the same O(n^) running time of the original algorithm. The 
following theorem argues about the optimality of the modified algorithm: 

Theorem A.l. The algorithm resulting from the modification given in Algorithmic obtains the 
optimal weighted-ii objective function value. 

Proof. In order to argue about the correctness of the modified algorithm, we compare the objective 
function obtained by the algorithm to the one obtained by the original algorithm, on an equivalent 
unweighted tree. 

The construction We construct the tree T = (y,E) by replacing each node i with a chain 
ii,. . . ,iwi, such that for any 1 < j < Wi, {ij,ij^i) G E. Additionally, we set for children k £ C{i) 
{kwkjii) ^ -^j ^'s parent £ {iw^,h) £ E. 

It is easy to see that T is a tree. Notice that T might be arbitrarily large (according to the 
weights). However, it is used only for the sake of proof of correctness, and never actually constructed 
by the algorithm. The following proof sketch highlights the equivalence of the uniform weight case 
to the weighted case. 

Claim 2. Let x and x be optimal assignments for T and T, respectively. Then g{x.) = /(x). 

The following immediate observation, which follows from the construction of T, implies the 
above claim. 

Observation 5. Let x be an optimal assignment for T. Then for any chain (zi, . . . , i^oj that 
corresponds to vertex i in T: 

■^ll -^12 ■ ■ ■ 'iDi 

The following claim complements claim [2j 
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Claim 3. Let x and x be the feasible assignments returned by Algorithm^ and the modified algo- 
rithm for weighted trees, respectively. Then 



5(x) = /(i) 



□ 



B The ^oo-norm 

We now turn our attention to the case of the £oo-norm; i.e. minimizing the maximal difference 
maXuevWi ~ In contrast to the case of the £i-norm, this optimization problem can be solved 
in a straightforward manner by using dynamic programming, even when the underlying graph is a 
directed acyclic graph. 

For a given value t >0, the algorithm will go over all nodes and tries to produce an assignment of 
objective value at most t. We can show that if the algorithm fails then there is no valid assignment 
of objective value at most t. To find the optimal objective value then one only needs to run a 
binary search on the variable t. 



Algorithm 3: The dynamic programming algorithm for the ^oo-norm case 
Input: DAG G=(V,E), vertices 1, . . . ,n sorted in topological order, vertex weight vector a. 

1 for i ^ 1 to n do 

2 I xr"^max{0,E,gcW^f"'«i-*} 

3 end 

4 return x™" 



As mentioned above, we perform a binary search on t in the range [0,^jaj]. Clearly, for an 
instance of the problem with optimal solution value r, the running time of Algorithm [3] would be 
0{n ■ logr). We now briefly outline the proof of correctness of the algorithm. 

Theorem B.l. For any given t > 0, x = x™" is a valid solution. Furthermore, i/ ||x — a||oo > t, 
then there does not exist a valid solution x' such that \\a — x'\\oo < t. 

Proof. The validity of x follows from definition. To prove the second part we show the following 
simple lemma. 

Lemma 4. //x' is a valid solution and ||a — x'IIqo — t, then for all i, ^ jjnm 

Proof. The proof follows with a simple induction on i. Note that because ||a— x'||oo < t, x[ > Oi — t. 
Furthermore, x' is a valid solution so x ■ > and 



where we have used the induction hypothesis for the second inequality. It then follows that, 



4 > E 



,min 



jec(i) 




□ 
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Now assume that there is a vahd solution x' with objective value at most t. It follows that for 
all i, 

ai-t< X™" <x[<ai + t, 
that is, llx"^™ -a||oo <t. □ 
C Hardness of Approximation in General Graphs 

As mentioned before when the objective value is the ii norm of the difference between the x and a 
vectors the (most general case of the) problem can be solved exactly in polynomial time by solving 
a linear program. In fact, it is not hard to see that using a similar approach one can solve this 
general case of the problem for any ip norm with 1 < p < oo. The only difference is that one has 
a linear program with infinitely many facets which has an efficient separation oracle and can be 
solved with the Ellipsoid method. 

For an instance where all the input values are integral, one might ask whether the task of finding 
an optimal integral solution is tractable or not. This is especially interesting for the ii case, since 
in the case of trees, an integral solution can be found efficiently by our algorithm of Section IH if 
the initial a values are integrals. Unfortunately, as soon as one considers the DAG case (even the 
special case of layered dags) this problem becomes intractable for essentially all ip norms. The 
following theorem summarizes our hardness results. 

Theorem C.l. Unless NP C r/Af£;(?i'^('°si°g'")) it is NP-hard to approximate the Integral Iso- 
tonic Regression problem for the case of directed acyclic graphs better than 0((logn)^^^) for the ip 
norm. 

Proof. We prove the theorem by a reduction from the Set Cover problem. In the Set Cover problem 
one is given sets 81,82, ■■■ , 8m such that Si U 52 U ... U 8m = {1, 2, . . . , n} and the objective is 
to select a minimum number of Sj's such that their union is still {1,2, ... ,n}. It is a well known 
result of Feige |Fei98] that unless NP C TIME" (n<^(^°s log n)^ -g ^p.^ard to approximate Set 
Cover better than a factor of (1 — o(l))logn. Our reduction uses vertex weights so as to simplify 
the construction. However, one can easily adapt the construction to the uniform case by adding 
multiple copies of nodes so as to simulate large weights. 

Given an instance of the Set Cover problem We construct the following instance of the SBHSP 
problem: 

• The vertex set of the output digraph will he V = {vi, . . . , Vm, ui, . . . Un}. 

• The edge set of the output digraph will be ii^ = {{vi,Uj) : j G 8i}. 

• The a values on the vertices will be as follows. For all Vi we have a{vi) = 1, while for all Uj 
we have a{uj) = \{i : j £ 8i}\ — 1. 

• The w values (weights) of the vertices will be as follows. For all Vi we have w{vi) = 1, while 
for all Uj we have w{uj) = m. 

On the one hand it is easy to see that for any set cover of the original instance (of size a) one can 
construct a solution to the SBHSP instance (of cost -^/a) by assigning x{vi) = if 8i is selected 
and 1 otherwise, and x{uj) = a{uj) for all Uj. 

On the other hand it is not hard to see that the optimal solution to the SBHSP will have 
x{vi) G {0, 1} for all i and x{uj) = a{uj) for all j. Furthermore, for any such solution (of cost 
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a) the set of 5^ for which x{vi) = can be easily seen to be a vahd set cover of size a^. Hence, 
a hardness of approximation of (1 — o(l))logn for the Set Cover problem implies a hardness of 
approximation of 0(-^logn) for SBHSP when the objective value is defined using the ip norm for 
any 1 < p < oo. □ 

Remark 1. The hard- instances of Set-Cover generated by Feige |Fei98j have less Sets than elements. 
As a result it is not hard to see that the hardness achieved by the proof of the above theorem is in 

fact ((1 — o(l)) Inn)"*^/^ for the weighted case and ( ^^ o(i))lnn \ non-weighted case. 



D FPTAS for optimizing under the ii norm for Bilayered graphs 

Consider a DAG G = {V, E) which is bilayered, i.e. the vertex set can be partitioned asV = UyjW 
and each edge is from the U side to the W side {E <^ U x W.) In this section we show a fast Fully 
Polynomial Approximation Scheme for SBHSP with the li norm for such DAGs. The run time will 
be close to linear in the size of the DAG. The algorithm is a simple reduction to a well known class 
of problems which admit such FPTASes. These problems are restricted class of the Mixed Positive 
Packing and Covering Problem, see |Fle04j . 

We start by the following simple observation. 

Lemma 5. When optimizing the l\ norm and when the input DAG is bilayered there is always an 
optimal solution with the following two properties, (i) \/w G : x^j = a^, (ii) £ U : Xu < au- 

Proof. Consider any optimal solution x, and a vertex w G for which x^j 7^ a^. If x^ < 
changing the assigned value of this vertex to a^ produces another valid solution with a better 
objective function value. Now consider the case in which Xw > aw If Xw > J2uec{w) then again 
we can improve the objective function by decreasing slightly, and if x^ = J2u£C{w) we can 
simultaneously decrease x^ and the assigned value of some of its children. This last step would help 
the objective value due to the improvement on and possibly hurt it by the exact same amount 
due to the decrease on its children, while maintaining a valid the solution. Doing this step on every 
node on the W side results in a solution that satisfies the first condition. 

For the second condition observe that if x^ > au we can simply decrease it to Ou without 
changing the validity of the solution while decreasing the objective function. In other words any 
optimal solution must satisfy the second condition. □ 

Given the above lemma one can write the following linear program whose solution is the exact 
value of the optimal solution. The left hand side is the original program based on the LP (I3ap - (l3d[) 
while the right hand side is the result of a simplification. 



min di 

subject to du > \/u £ U 

Xu>0 yueu 

du + Xu = Ou \fu eU 



min di 

subject to du > £ U (13a) 

du < Ou yu £ U (13b) 



E'S^du>{y^au)-aujyw£W (13c) 
Xu<au, yw€W ^ ^ 

u£C{w) 

Once written in this form the above formulation is a, so called, Mixed Positive Packing and Covering 



15 



Program. In fact it is among a certain class of such programs for which Fleischer [Fle04| provides 
a fast FPTAS. In particular, we have the following theorem. 

Theorem D.l. When the input is a bilayered graph and the objective value is in terms of the l\ 
norm, there is an algorithm that given e > runs in time 0{\V\\E\ log(|y|)/e^) and returns a valid 
solution with objective value no more than (1 + e) times that of the optimum. 

Proof. The proof is a simple application of Theorem 2.1 from [Fle04j to the above Linear Program. 
A simple corollary of that theorem is that the algorithm finishes in 0(|y| lE'j log(|y|)/e^) step£] 
and in each step one has to find the most unsatisfied constraint among (|13ap - (jl3cp given a current 
solution d. Each such step can be done by evaluating all the constraints in total time \E\. □ 

E Omitted figures and algorithms 



Figure 1: A counter-example for the naive algorithm. Node annotations denote the a values. The 
naive algorithm will obtain an objective function value of 4, whereas the optimum value is 2. 




(D 



^the constant C in Theorem 2.1 of [Fle04] is 1 in our case. 
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Algorithm 4: The improved LnproveSubtree procedure 
Input: Vertex v 

1 begin 

/* If Xy = tty T{v) is optimal */ 

2 if Xt, > then 

3 I Push-Search (-y, oo, 0) 

4 end 

5 end 

6 Push-Search ( Vertex n, Non-negative real-value e, Non-negative Integers') 

7 begin 



8 


Set-Params (n, e, (5) 


9 


if €u = then 


10 




return 


11 


end 


12 


sum ^ 


13 


I 


^ minjeu, Xu Z^fcxhild of u ^k} 


14 


if £ > and (5^ > then 


15 




Xu ^ Xu — sum ^ i 


16 




Set-Params (n, 5,e — 


17 


end 


18 


foreach j G c{u) do 


19 




if sum = eu then 


20 




1 return /* Speedup 


21 




end 


22 




t -^Push-Search (j, e^, 6u) 


23 




sum ^ sum + t, Xu ^ Xu — t 


24 




Set-Params (n, 5, e — sum) 


25 


end 


26 


return sum 


27 end 





28 Set-Params ( Vertex z, Non-negative integer 5, Non-negative real-value e) 

29 if Xi > Oi then 

30 I 5j 5 + 1, ei <— min{e, Xj — aj} 

31 end 

32 else 

33 I 6i <— 6 — l,ei <— mm{xi,e} 

34 end 
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